State-of-the-art speaker verification frameworks have typically focused on speech enhancement techniques with increasingly deeper (more layers) and wider (number of channels) models to improve their verification performance. Instead, this paper proposes an approach to increase the model resolution capability using attention-based dynamic kernels in a convolutional neural network to adapt the model parameters to be feature-conditioned. The attention weights on the kernels are further distilled by channel attention and multi-layer feature aggregation to learn global features from speech. This approach provides an efficient solution to improving representation capacity with lower data resources. This is due to the self-adaptation to inputs of the structures of the model parameters. The proposed dynamic convolutional model achieved 1.62\% EER and 0.18 miniDCF on the VoxCeleb1 test set and has a 17\% relative improvement compared to the ECAPA-TDNN.
translated by 谷歌翻译
End-to-End automatic speech recognition (ASR) models aim to learn a generalised speech representation to perform recognition. In this domain there is little research to analyse internal representation dependencies and their relationship to modelling approaches. This paper investigates cross-domain language model dependencies within transformer architectures using SVCCA and uses these insights to exploit modelling approaches. It was found that specific neural representations within the transformer layers exhibit correlated behaviour which impacts recognition performance. Altogether, this work provides analysis of the modelling approaches affecting contextual dependencies and ASR performance, and can be used to create or adapt better performing End-to-End ASR models and also for downstream tasks.
translated by 谷歌翻译
多语言语音识别已引起大幅关注,作为补偿低资源语言数据稀缺性的有效方法。端到端(E2E)建模比常规混合系统优选,这主要是由于没有词典要求。但是,在有限的数据方案中,混合DNN-HMM仍然优于E2E模型。此外,手动词典创建的问题已通过公开训练的素式训练型(G2P)(G2P)和多种语言的IPA音译来缓解。在本文中,在低资源语言的多语言设置中提出了一种混合DNN-HMM声学模型的新型方法。针对目标语言语言信号的不同单语言模型的后验分布融合在一起。为每个源目标语言对训练了一个单独的回归神经网络,以将后者从源声学模型转换为目标语言。与ASR培训相比,这些网络需要非常有限的数据。与多语言和单语基线相比,后融合的相对增益分别为14.65%和6.5%。跨语性模型融合表明,无需使用依赖语言的ASR的后代,就可以实现可比的结果。
translated by 谷歌翻译
多语言自动语音识别(ASR)系统大多受益于低资源语言,但相对于单语言对应物,多种语言的性能下降。有限的研究集中在理解多语言语音识别设置中的语言行为。在本文中,提出了一种新型的数据驱动方法来研究跨语性的声学表达相似性。该技术衡量了各种单语言模型与目标语音信号的后验分布之间的相似性。深度神经网络被训练为映射网络,以将分布从不同的声学模型转换为直接比较的形式。分析观察到,语言接近性无法通过集合音素的体积真正估计。对拟议的映射网络的熵分析表明,具有较小重叠的语言可以更适合跨语性转移,因此在多语言设置中更有益。最后,提出的后验变换方法被利用为目标语言的单语模型融合。比单语言对应物的相对提高约为8%。
translated by 谷歌翻译
对于语音情绪数据集,与日常生活中显示的表现力较低的情绪相比,很难获得大量可靠的数据,而表现出的情绪可能超过了最高。最近,已经创建了具有自然情绪的较大数据集。这项研究并没有忽略较小的,行为的数据集,而是研究了从动作情绪中学到的信息是否对检测自然情绪有用。跨科普斯研究主要考虑了跨语言甚至跨年龄数据集,并且源于注释情绪导致性能下降的不同方法。为了保持一致,考虑了四个涵盖行为的成年英语数据集,考虑了自然情绪。提出了最先进的模型,以准确研究性能的降解。该系统涉及双向LSTM具有注意机制,以对数据集进行分类。实验研究了跨科普斯和多域的训练模型的影响,结果表明信息的传递不成功。室外模型,其次是适应丢失的数据集,而域对抗训练(DAT)被证明更适合于跨数据集的情绪概括。这显示了从ACT的数据集转移到具有更多自然情绪以及对不同语料库培训的好处的积极信息。
translated by 谷歌翻译
在许多语音技术应用中,语音覆盖是一个重要阶段。该领域的最新工作已由深度神经网络模型主导。时间卷积网络(TCN)是深度学习模型,已在消除语音的任务中为序列建模而提出。在这项工作中,提出了加权多污染深度分离的卷积,以替代TCN模型中标准的深度可分离卷积。该提出的卷积使TCN能够在网络中每个卷积块的接收场中动态关注或多或少的本地信息。结果表明,这种加权的多污染时间卷积网络(WD-TCN)始终优于各种模型配置和使用WD-TCN模型的TCN,这是一种更有效的方法,可以提高模型的性能,而不是增加增加模型的性能。卷积块。基线TCN的最佳性能改进是0.55 dB标准不变的信噪比(SISDR),并且最佳性能WD-TCN模型在WHAMR数据集上达到12.26 dB SISDR。
translated by 谷歌翻译
语音覆盖通常是强大的语音处理任务中的重要要求。有监督的深度学习(DL)模型为单渠道语音消失提供了最先进的性能。时间卷积网络(TCN)通常用于语音增强任务中的序列建模。 TCN的一个功能是,它们具有依赖于特定模型配置的接收场(RF),该模型配置确定了可以观察到的输入框架的数量,以产生单个输出框架。已经表明,TCN能够对模拟语音数据进行编织,但是进行了彻底的分析,尤其是在文献中尚未关注RF。本文根据TCN的模型大小和RF分析了覆盖性能。使用WHAMR语料库进行的实验,该实验扩展到包括较大T60值的房间脉冲响应(RIR)表明,较大的RF在训练较小的TCN模型时可以显着改善性能。还可以证明,当用更大的RT60值解冻RIR时,TCN受益于更宽的RF。
translated by 谷歌翻译
对语音增强系统的培训通常不会纳入人类感知的知识,因此可能导致不自然的声音结果。通过预测网络将精神上动机的语音感知指标纳入模型培训的一部分,最近引起了人们的兴趣。但是,此类预测因子的性能受到培训数据中出现的度量分数的分布的限制。在这项工作中,我们提出了Metricgan +/-(Metricgan+的扩展,一个这样的度量动机系统),该系统引入了一个额外的网络 - 一个“脱发器”,该网络试图改善预测网络的稳健性(并通过扩展。发电机)通过确保观察训练中更广泛的度量得分。VoiceBank数据集的实验结果显示,PESQ得分的相对改善为3.8%(3.05 vs 3.22 PESQ得分),以及更好地概括对看不见的噪音和语音。
translated by 谷歌翻译
View-dependent effects such as reflections pose a substantial challenge for image-based and neural rendering algorithms. Above all, curved reflectors are particularly hard, as they lead to highly non-linear reflection flows as the camera moves. We introduce a new point-based representation to compute Neural Point Catacaustics allowing novel-view synthesis of scenes with curved reflectors, from a set of casually-captured input photos. At the core of our method is a neural warp field that models catacaustic trajectories of reflections, so complex specular effects can be rendered using efficient point splatting in conjunction with a neural renderer. One of our key contributions is the explicit representation of reflections with a reflection point cloud which is displaced by the neural warp field, and a primary point cloud which is optimized to represent the rest of the scene. After a short manual annotation step, our approach allows interactive high-quality renderings of novel views with accurate reflection flow. Additionally, the explicit representation of reflection flow supports several forms of scene manipulation in captured scenes, such as reflection editing, cloning of specular objects, reflection tracking across views, and comfortable stereo viewing. We provide the source code and other supplemental material on https://repo-sam.inria.fr/ fungraph/neural_catacaustics/
translated by 谷歌翻译
Edge computing is changing the face of many industries and services. Common edge computing models offload computing which is prone to security risks and privacy violation. However, advances in deep learning enabled Internet of Things (IoTs) to take decisions and run cognitive tasks locally. This research introduces a decentralized-control edge model where most computation and decisions are moved to the IoT level. The model aims at decreasing communication to the edge which in return enhances efficiency and decreases latency. The model also avoids data transfer which raises security and privacy risks. To examine the model, we developed SAFEMYRIDES, a scene-aware ridesharing monitoring system where smart phones are detecting violations at the runtime. Current real-time monitoring systems are costly and require continuous network connectivity. The system uses optimized deep learning that run locally on IoTs to detect violations in ridesharing and record violation incidences. The system would enhance safety and security in ridesharing without violating privacy.
translated by 谷歌翻译